| fixed.acidity | volatile.acidity | citric.acid | residual.sugar |
|---|---|---|---|
| Min. : 4.60 | Min. :0.1200 | Min. :0.000 | Min. : 0.900 |
| 1st Qu.: 7.10 | 1st Qu.:0.3900 | 1st Qu.:0.090 | 1st Qu.: 1.900 |
| Median : 7.90 | Median :0.5200 | Median :0.260 | Median : 2.200 |
| Mean : 8.32 | Mean :0.5278 | Mean :0.271 | Mean : 2.539 |
| 3rd Qu.: 9.20 | 3rd Qu.:0.6400 | 3rd Qu.:0.420 | 3rd Qu.: 2.600 |
| Max. :15.90 | Max. :1.5800 | Max. :1.000 | Max. :15.500 |
| chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density |
|---|---|---|---|
| Min. :0.01200 | Min. : 1.00 | Min. : 6.00 | Min. :0.9901 |
| 1st Qu.:0.07000 | 1st Qu.: 7.00 | 1st Qu.: 22.00 | 1st Qu.:0.9956 |
| Median :0.07900 | Median :14.00 | Median : 38.00 | Median :0.9968 |
| Mean :0.08747 | Mean :15.87 | Mean : 46.47 | Mean :0.9967 |
| 3rd Qu.:0.09000 | 3rd Qu.:21.00 | 3rd Qu.: 62.00 | 3rd Qu.:0.9978 |
| Max. :0.61100 | Max. :72.00 | Max. :289.00 | Max. :1.0037 |
| pH | sulphates | alcohol | quality | rating |
|---|---|---|---|---|
| Min. :2.740 | Min. :0.3300 | Min. : 8.40 | 3: 10 | bad : 63 |
| 1st Qu.:3.210 | 1st Qu.:0.5500 | 1st Qu.: 9.50 | 4: 53 | average:1319 |
| Median :3.310 | Median :0.6200 | Median :10.20 | 5:681 | good : 217 |
| Mean :3.311 | Mean :0.6581 | Mean :10.42 | 6:638 | NA |
| 3rd Qu.:3.400 | 3rd Qu.:0.7300 | 3rd Qu.:11.10 | 7:199 | NA |
| Max. :4.010 | Max. :2.0000 | Max. :14.90 | 8: 18 | NA |
First, I will beging exploring each variable to get a better sense of the data and how we can use it.
The distribution of fixed Acidity looks close to normal distrubtion that is a little positive skewed. It has some outliers around 16 acidity.
The distribution seems a little bimodal around 0.4 and 0.6 values. There are some extreme outliers that are greater than 0.8 value with some up to 1.5 even.
For Citric Acid, it seems that the biggest density is with zero. Then we get spikes of density around 0.02, 0.2 and 0.45 values. That could be because of popular citric values that most people prefer in their wine.
Most of the density is around value 2 with outliers past value 4 till 15.
The data looks to be positive skewed distribution with high concentration around 6 value.
The data here also feels positively skwed with some extreme outliers around 300.
The data here looks normally distributed.
The values for PH seem to be normally distributed with some farout outliers around 4 and 0.5
The data here looks normal with mean around 0.7 with outliers all the way to value 2. This could be because of special wines that cater to niche tastes.
Alcohol seems to have a similar distribution to sulphur. There could be a relationship there that is worth exploring.
Chlorides seems to be concentrated around 0.08 value where the majority of observations are in. It has some outliers in the extreme.
The majority of the data is centered around average quality. This could make predictive models based on quality less predictive.
Rating value ranges is deteremined as follows: . Bad – <5 . Average – 5<= Quality < 7 . Good – 7<= Quality < 10
In thd dataset, there are 1599 obseervations of wine. It contains 12 features with only one categorical variable which is quality. The other variables are numerical that describe the wine properties.
Other observations: Most wines are of average quality around 6 value. This makes the dataset unbalanced with far fewer data on more refined wine than average quality. This is understandle since fine wine is supposed to be more rare than average, but gives us less data to draw conclusions and predictions with average or even bad wines.
The main feature in the dataset is quality. We could try and determine what consitutes the quality of wine based on its physical values.
There are variables that define taste, like citricity, PH and sugar. I am interested in seeing which factors of taste affect the quality of th wine. There is also the alcohol level of the wine and how it might affect quality.
I created rating variable that categorizes the quality to imporve visualization.
There was no unusual data in the dataset.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
##
## n= 1599
##
##
## P
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 0.0000 0.0000
## volatile.acidity 0.0000 0.0000
## citric.acid 0.0000 0.0000
## residual.sugar 0.0000 0.9389 0.0000
## chlorides 0.0002 0.0142 0.0000
## free.sulfur.dioxide 0.0000 0.6747 0.0147
## total.sulfur.dioxide 0.0000 0.0022 0.1555
## density 0.0000 0.3788 0.0000
## pH 0.0000 0.0000 0.0000
## sulphates 0.0000 0.0000 0.0000
## alcohol 0.0136 0.0000 0.0000
## quality 0.0000 0.0000 0.0000
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.0000 0.0002 0.0000
## volatile.acidity 0.9389 0.0142 0.6747
## citric.acid 0.0000 0.0000 0.0147
## residual.sugar 0.0262 0.0000
## chlorides 0.0262 0.8241
## free.sulfur.dioxide 0.0000 0.8241
## total.sulfur.dioxide 0.0000 0.0581 0.0000
## density 0.0000 0.0000 0.3805
## pH 0.0006 0.0000 0.0049
## sulphates 0.8252 0.0000 0.0389
## alcohol 0.0926 0.0000 0.0055
## quality 0.5832 0.0000 0.0428
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.0000 0.0000 0.0000 0.0000 0.0136
## volatile.acidity 0.0022 0.3788 0.0000 0.0000 0.0000
## citric.acid 0.1555 0.0000 0.0000 0.0000 0.0000
## residual.sugar 0.0000 0.0000 0.0006 0.8252 0.0926
## chlorides 0.0581 0.0000 0.0000 0.0000 0.0000
## free.sulfur.dioxide 0.0000 0.3805 0.0049 0.0389 0.0055
## total.sulfur.dioxide 0.0044 0.0078 0.0860 0.0000
## density 0.0044 0.0000 0.0000 0.0000
## pH 0.0078 0.0000 0.0000 0.0000
## sulphates 0.0860 0.0000 0.0000 0.0002
## alcohol 0.0000 0.0000 0.0000 0.0002
## quality 0.0000 0.0000 0.0210 0.0000 0.0000
## quality
## fixed.acidity 0.0000
## volatile.acidity 0.0000
## citric.acid 0.0000
## residual.sugar 0.5832
## chlorides 0.0000
## free.sulfur.dioxide 0.0428
## total.sulfur.dioxide 0.0000
## density 0.0000
## pH 0.0210
## sulphates 0.0000
## alcohol 0.0000
## quality
We can see many strongly correlated variables that would be interesting to explore further. Like alcohol level and rating. Also fixed acidity and density show strong correlation. Some correlations are to be expected like quality and rating or free sulfur dioxide and total sulfur dioxide.
The variables that correlate strongly with quality are: . Alcohol . Sulphates . Citric Acid . Volatile Acidity
We will explore those variables with quality.
Interesting to note that residual sugar has very little correlation with quality.
It is very interesting correlation. It seems that most high quality wine have high alcohol levels. Average wine (where most of our data is on) has some outliyers on alcohol level. Because of those outliers, we cannot rely on alcohol alone to predict quality.
Here we see a small upward correlation with sulphates and quality. However, even more than Alcohol levels, we get a large number of outliers that will throw off any predictions.
There is a clear positive trend with Citric Acid and quality. The vast majority of low quality wine had very little citric acid while high rating wine had more. However there are outliers with high rating but with almost no citric acid. My guess is they are special wine trying to cater to niche tastes.
We have a strong negative correlation with volatile acidity. Whiole there are some outliers, this seems like a good predictive variable for quality.
This strong linear correlation is to be noted. Acidity seems to be a very strong factor for wine density.
This relationship is to be expected, since the more citric acid, the more acidic the wine is and the lower the PH level.
Alcohol seems to be one of the best predictors even though it suffers from outliers that can throw off the model. But it seems that most highly rated wines have large levels of alcohol. Sulphates has a positive correlation with quality, however it suffers from a large number of outliers that could throw off the model. Citric Acid has a clear positive trend with quality and seems like a very strong predictor. Although there are some outliers, this variable is quite strongly correlated with wine quality. Volatile Acidity has a negative correlation with wine quality. It does suffer from outliers but the trend is quite obvious.
Acidity to Density behaved as expected with a linear relationship. PH and Citric Acid also behaved linearly as to be expected.
The relationship between total.sulfur.dioxide and free.sulfur.dioxide as they do explain each other.
In this section we will attempt to explore relationship between variables and quality at the same time.
Since Alcohol seems one of our strongest relationship, it is worth exploring how it correlates with other variables.
There seems to be a small correlation between chlorides and quality when Alcohol is held constant. What I find surprising is the spike of values in chlorides at low alcohol values. I wonder if it is related to the fermentation process some how.
There seems to be little correlation between Total Sulfur and Alcohol or Quality.
There does not seem to be correlation between density and quality.
There is almost no correlation between citric acid and Alcohol. However there is a trend between citric acid and rating that further confirms our previous suspicion.
Since Citric acid seems like a strong variable in predicting the quality of wine, its worth exploring how it correlates with other variables and quality.
There does not seem to be any relationship between volatile acidity and rating, but a downward trend with citric acid.
Very little relationship spotted here.
Now that we have explored the variables, we will attempt to build a prediction model for the quality of the wine.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + citric.acid, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates +
## volatile.acidity, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + citric.acid + sulphates +
## chlorides, data = training_data)
##
## ==========================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------
## (Intercept) -0.192 -0.186 -0.612** 0.679** -0.279
## (0.210) (0.206) (0.212) (0.237) (0.220)
## alcohol 0.368*** 0.348*** 0.341*** 0.312*** 0.314***
## (0.020) (0.020) (0.020) (0.019) (0.020)
## citric.acid 0.737*** 0.508*** -0.131 0.574***
## (0.107) (0.110) (0.122) (0.110)
## sulphates 0.847*** 0.691*** 1.071***
## (0.126) (0.122) (0.133)
## volatile.acidity -1.346***
## (0.130)
## chlorides -2.498***
## (0.502)
## ------------------------------------------------------------------------------------------
## R-squared 0.231 0.263 0.291 0.354 0.307
## adj. R-squared 0.230 0.261 0.289 0.351 0.304
## sigma 0.705 0.691 0.678 0.647 0.671
## F 335.604 198.731 152.640 152.474 123.096
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1195.976 -1172.512 -1150.450 -1098.668 -1138.169
## Deviance 555.514 532.699 512.103 466.834 500.984
## AIC 2397.951 2353.024 2310.901 2209.336 2288.339
## BIC 2413.012 2373.104 2336.002 2239.457 2318.460
## N 1119 1119 1119 1119 1119
## ==========================================================================================
From the R-squared values, model 4 seems our best option.
High alcohol contents and high citric acid seems to be important for good wine.
There seems to be a correlation between chlorides and alcohol. I wonder if it is a byporduct of the fermenting process.
The prediction power of this model is not very strong. This is because of how outliers are affecting us. Also I suspect the main reason for the poor predictibility is the very little data we have on poor or excellent wine. Most of the data is regarding average wine.
Here we revist some of the interesting plots.
While most fine wine has a higher concentration of alcohol as evident from the graph, the distinction betwen average and bad in terms of alcohol concentration is not as prononounced. Clearly here that high concentration of alcohol is important for fine wine, but considering how little data we have on fine wine, it could just be sampling noise.
This correlation is interesting because of the spike of chloride concentration at 7% alcohol level. This could be because of the way this alcohol level is achieved. I can’t see a clear relation between chlorides and quality however.
The linear model used only covers around 35% of the variance in data to predict quality based on R-squared value of 0.354. This quite a poor model to depend on for quality prediction. This is likely because of the lack of data on fine or poor wine.
The wind dataset contains physical as well as sensorial properties of the wine. Chemicals and properties of the wine are recorded as well as an over all rating of the wine. We based the primary factor of our analysis to be Quality.
I started by examining the individual properties in the dataset. Most were either normal distribution or long tailed. However what stood out immediately was how most of the data was on average wine quality. This really created a shadow over the analysis as it lowers the confidence level of the results. Due to the lack of data on fine and poor wine, there was very little in the way of predicting trends.
After doing a pearsons correlation test, it was apparent that those variables correlated the most with quality: . Alcohol . Sulphates . Citric Acid . Volatile Acidity
When exploring the variables further with quality, two variables showed to be stronger in correlating with quality, Alcohol and citric acid. Both variables showed strong outliers however further lowering my expectations for a strong prediction model. It was interesting to note that many observations had zero citric acid. That is expected since citric acid is sometimes added to wine for taste.
When doing multivariate analysis, nothing stood out except for alcohol vs chlorides. Around 6% alcohol level, there is a spike of chloride levels. While little correlation with quality, that spike must be related somehow to the process of fermenting the wine to reach that particular alcohol level or an anomoly in the data.
I proceeded to build a linear regression model even though my confidence in the predictablity provided by the data was low. As expected the model could only predict 35% of the variance.
The dataset sadly is not rich enough for strong analysis. It needs a lot more observation on fine wine and poor wine. Also the rating quality is a subjective value. It is not obvious how this value is obtained or documented.